Using large-scale corpora in text-based research comes often evokes the hope that automated content analysis and “distant reading” (Moretti 2013) will facilitate generalizations that can not be attained by the close qualitative analysis of a limited number of texts. Yet as supervised machine learning requires training data that very need to be prepared by way of text annotation, the requirement to work with individual texts is not gone. And if we accept the call to “validate, validate, validate” (Grimmer and Stewart 2013) to reach sound research results, i.e. to corroborate quantitative findings by qualitative means, those who have a quantitative take on texts still need tools for inspecting and annotating texts qualitatively.
The requirement to integrate quantitative and qualitative approaches to text is not new. Indeed, there is a breadth of tools for text annotation, or for “coding” texts, as it is mostly called in the social sciences. In the context of information science and computational linguistics, server-based solutions such as brat, WebAnno or INCEpTION are powerful tools for complex text annotation tasks. In the social sciences, ATLAS.ti and MAXQDA dominate the market for the computer-assisted analysis of qualitative data. These are commercial products, to be installed locally, that offer a rich functionality for coding text. So why yet another tool for text analysis?
The open source annolite R package offers a leightweight fulltext display and text annotation toolset designed to be used with RStudio. Its functionality is much more limited than the powerful server-based solutions or the commercial products. But for a set of common, straight-forward text annotation scenarios that have low-level technical requirements - if using the computer for text annotation is essentially like working with printed copies texts using a set of highlighters, possibly writing comments on the margins with a pencil -, annolite will offer the functionality that is sufficient. If your annotation task is basic, annolite has the advantages that it is is open source, easy to install, and designed to be integrated seamlessly into a pure R workflow.
The technical essence of the package is the htmlwidget annolite. The R functionality of the package exposes the htmlwidget (which is written in JavaScript) to the R environment and ensures the communication, i.e. the transfer of data, between R and JavaScript. A shiny gadget launched using the annotate() method can be used to annotate a text document within the RStudio environment. The gadget returns a annotationstable object which can easily be processed within R. Furthermore, annolite transperency of the research process. The annolite htmlwidget can be used to embed the fulltext display of an annotated text in the analysis or the research report you write as an Rmarkdown document. Compared to existing tools for text annotation, the added value of annolite is that you have a seamless integration of simple text annotation tasks in a pure R workflow and that you create maximum transparency on the annotation step in a research project by embedding texts and annotations in a html documment written in Rmarkdown.
There are two different uses of the annolite htmlwidget:
Annotation Mode: It can be used to create highlighter & pencil-style annotations. This scenario requires embedding the annolite-htmlwidget into an interactive environment. The package includes a Shiny Gadget called through the annotate()-function. The gadget returns an annotationstable object to be processed in the R session.
Display Mode: The annolite htmlwidget can be used to inspect a document and its annotations. The htmlwidget will (a) reconstruct the fulltext, (b) highlight tokens that have been annotad using the color that has been designated for a code and (c) include tooltips with the comment that has been written on an annotation. The htmlwidget can be displayed as a html page generated in the R session (using RStudio’s Viewer pane if you use the IDE), and it can be included in html documents generated from Rmarkdown. A special feature is that one single htmlwidget can report the annotations of multiple documents.
The annolite package is a GitHub-only package at this stage. You can install it from the GitHub presence of the PolMine Project as follows:
To install the development version of annolite, install the package from the dev branch of the repository.
To check whether annolite has been installed succesfully, load it in you R session.
There are no messages annolite will generate by default. Note hat the name of the annotate() method suggests itself for annotation tasks (which may be substantively very different). The name used by other packages, too. Most importantly, if you rely on the NLP package and have loaded it before loading annolite you might see the message “The following object is masked from ‘package:NLP’: annotate”. In this case, prepend the package name followed by two collons to use the annotate() method of the annolite package, i.e. call annolite::annotate().
The package defines two different core classes to represent the fulltext of a document and associated annotations.
fulltextlist classThe S3 class fulltextlist is a list of lists with information on the content and the formatting of the chunks of text that make up a document. The class and its data structure are designed such that it can be generated from any kind of text input and transferred seamlessly to JavaScript. This is why fulltext is represented by a simple S3 class.
The lists that make up the fulltextlist class are the chunks of text that make up a text documment. Each chunk is a list with the following building blocks:
element: A length-one character vector defining the html element that will define the layout of a chunk of text when it is assembled by the htmlwidget (e.g. “h1” for level 1 headline, “p” or “para” for plain paragraphs)
attributes: A named character vector defining attributes of the element and the values of the attributes. Attributes used at this stage are “style” with the values “display:block” or “display:none”, and “name”. The attribute “name” is relevant when multiple documents are displayed in one combined HTML element.
tokenstream: The actual text of each chunk of text (a headline, a paragraph, or any other region of text) is represented by a data.frame in a tokenized, tabular format. This data.frame needs to include a column id with a unique token id (used to match annotations and tokens), a colum token with the wordform to be displayed, and a column whitespace that includes the whitespace to be prepended to a token.
The fulltextlist() method serves as a the constructor for the fulltextlist class. It is flexible and will facilitate working with annolite without strong assumptions on the data structure you should offer. The method accepts different inputs. The most elementary input is a list of character vectors.
Jane Austen’s novel Emma is a popular text for teaching text analysis with R. It is part of the R package janeaustenr and is part of the examples used for the tidy approach to text mining using the (see Julia Silge and David Robinson’s book Text Mining with R). This package includes the book as sample data, a list of the chapters of Emma that are lists of character vectors representing tokenized paragraphs. The data looks as follows.
## [1] "Emma" "Woodhouse" "," "handsome" ","
## [6] "clever" "," "and" "rich" ","
## [11] "with" "a" "comfortable" "home" "and"
## [16] "happy" "disposition" "," "seemed" "to"
## [21] "unite" "some" "of" "the" "best"
## [26] "blessings" "of" "existence" ";" "and"
## [31] "had" "lived" "nearly" "twenty" "-"
## [36] "one" "years" "in" "the" "world"
## [41] "with" "very" "little" "to" "distress"
## [46] "or" "vex" "her" "."
To turn the first chapter of emma into a fulltextlist, supply the fulltextlist() method with the list of character vectors with the tokenized paragraphs as input.
The object emma_ch1 is the list of chunks introduced before. To illustrate, we inspect the first chunk.
## name element style
## (3,8] para display:block
## tokenstream
## (3,8] Emma, Woodhouse, ,, handsome, ,, clever, ,, and, rich, ,, with, a, comfortable, home, and, happy, disposition, ,, seemed, to, unite, some, of, the, best, blessings, of, existence, ;, and, had, lived, nearly, twenty, -, one, years, in, the, world, with, very, little, to, distress, or, vex, her, ., -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Having the fulltextlist object is enough to generate a htmlwidget that
To learn more about the using the fulltextlist() method and its flexibility, consult the documentation (?fulltextlist).
The functionality to include fulltext based on the fulltextlist class in an Rmarkdown may be convenient at times. But annolite is about annotating text and this requires a furter data structure to manage annotations. This is the purpose of the the annotationstable class.
annotationstable classThe annotationstable class is a S3 superclass of a data.frame. So it is and behaves like a data.frame, but using the class is useful for being able to define specified methods. An annotationsstable has the columns “text”, “code”, “color”, “annotation” , “start” and “end”. This captures all the information generated when annotating a document using annolite.
To convey what this table looks like, inspect the sample_annotation object that is included as sample annotation in the package.
## text
## 1 eine sehr verehrten Damen und Her
## 2 wierigsten Finanzund Wirtsc
## 3 Regierungskoalition
## 4 rbeitsplätze sorgt . Wir haben uns sehr intensiv Gedanken darüber gemacht , wie wir auch in sc
## code color annotation start end
## 1 orange orange Asdf 36137 36142
## 2 orange orange Uff 36175 36177
## 3 orange orange Gov 36260 36260
## 4 orange orange labor\n 36418 36434
Note that what you see in the column “text” is the text selection when creating an annotation that does not necessarily cover entire words. The token ids in this table are fairly high numbers because these are (arbitrary) annotations of a speech given by Volker Kauder as part of the GermaParl - a fairly small subset of a much larger corpus.
To learn more about the annotationstable, see the documentation of the annotationstable() function (?annotationstable) that serves as a constructor of an (empty) annotationstable.
To annotate text documents using annolite, the package includes a Shiny Gadget that has the annolite htmlwidget at its core. The gadget is called using the function annotate(), with a fulltextlist of the document to be annotated as input. If annotations have already been made, supply the respective annotationstable as argument annotations. The gadget is designed to be used in an interactive RStudio session. It will be run in the Viewer pane of RStudio and the annotationstable with your (new) annotations will be returned when the gadget is closed.
The example aims at outputting one particular speech. We take a speech held by Voker Kauder in the German Bundestag.
kauder_speech <- corpus("GERMAPARLMINI") %>%
subset(speaker == "Volker Kauder" & date == "2009-11-10")The data that is passed to the JavaScript that generates the output. Expected to be a list of lists that provide data on sections of text. Each of the sub-lists is to be a named list of a character vector with the HTML element the section will be wrapped into, and a data.frame (or a list) with a column “token”, and a column “id”.
Everything is prepared now to call the htmlwidget.
To put the htmlwidget to real use, it needs to be embedded into a Shiny Gadget that will administer the interface between R and JavaScript and that will prepare and return an annotationstable to the R session. The Shiny Gadget is launched by calling the annotate() on the fulltextlist for a document.
To tailor the buttons according to your needs, use dialog_radio_buttons(), see the following code.
dialog <- dialog <- list(
choices = dialog_radio_buttons(
organization = "yellow",
document = "lightgreen"
)
)Now everything is ready to launch the gadget.
It is not possible to embed a Shiny Gadget into a standalone html document. The following “movie” conveys a sense of the annotation process.
Whatever tool used for annotating documments: Creating annotations requires theoretical guidance, reflection and time. Annotations are a valuable research resource. Making documents and their annotations accessible for inspecting, questioning and improvement is an important step towards transparency and quality control. The display mode of the annolite htmlwidget has this purpose in mind.
To use annolite for displaying documents with annotations, you supply the documment as a fulltextlist, you do not define a dialog for the interactive annotation part (argument dialog is NULL), and you hand over an annotationstable via the annotations argument.
The code to create an annolite htmlwidget and the result can be embedded in Rmarkdown documents (including slides) easily. You can send around the html document or put it online (e.g. via GitHub Pages). Given the almost universal access to browsers that will be able to process the html documents you generate, barriers for creating maximum transparency on your research are minimal. Nobody will have to install a server to inspect and check the annotations you and your team made. It is not necessary to acquire access to proprietary software before your annotations can be looked at.
The package includes a minimal dataset with annotations on the speech of Volker Kauder that served as an example initially (object sample_annotation). This is how you create the htmlwidget and include it in the document.
kauder_speech_flist <- corpus("GERMAPARLMINI") %>%
subset(speaker == "Volker Kauder" & date == "2009-11-10") %>%
fulltexttable()
annolite(kauder_speech_flist, annotations = sample_annotation)Annotation will rarely be limited to a single document. Usually, a larger collection of documents will be annotated. It would be tedious having to write the code to prepare the htmlwidget for every single document. Yet you can combine multiple htmlwidgets into one browsable HTML element that can communicate with each other via a mechanism introduced with the crosstalk package. The annolite htmlwidget is crosstalk-enabled and provides the possibility to inspect multiple annotated documents in one HTML element.
Technically, the HTML element you can use to inspect multiple annotated documents will consist of a datatables htmlwidget as created by DT::datatable(), and an annolite htmlwidget. The table will list the names of the documents. By clicking a document name, the respective annotated documment will be displayed.
The annolite() method will prepare this output. But this time, the input needs to be a list of fulltextlist objects with defined names. We use the chapters of emma as an example and start with preparing the list of fulltextlist objects. Each of these objects in the list will include the fulltext of a chapter.
The interaction between the table and the annolite htmlwidget requires that the name compontent is a non-empty character vector. So we iterate through the chapters again and fill the name component of the fulltextlist objects.
emma_chapters <- lapply(
seq_along(emma_chapters),
function(i){
emma_chapters[[i]][["name"]] <- sprintf("Chapter %s", i)
emma_chapters[[i]]
}
)This is an example where you fill find the name …
## [1] "Chapter 2"
The input required for an html element with a table to select documents and a documeent viewer is fully prepared. In this example, we limit the input to the first five chapters of Emma to be somewhat parsimonious with the data to be included in the vignette.
The original intention behind the development of annolite is to offer a leightweight open source tool for a pure R workflow to create and display annotations. The widget is sufficiently flexible to allow alternative uses. You can also use it as a viewer to highlight tokens or token sequences, any kind of dictionary that is relevant for your work.
The display mode of the annolite widget can be used to qualitatively evaluate how a topic model works. In Blei’s original work on topic models, there is a very telling snippet of a text highlighting the words that indicate three different topics (Blei 2012, 78).
The annolite widget can get you this kind of output which may be very useful to evaluate topic models. In the example we use to exemplify this usage, a topic model (k = 180) is computed for the speeches in the United Nations General Assembly. Topic 105 might be indicative for a thematic focus on international migration. To evaluate this, the following HTML element can be used - you can skip through the speeches that went into the computation of the topicmodel, the terms indicative for topic 105 (first 50 terms only) are highlighted in yellow. Note that to maintain a reasonable size of the vignette, only 25 documents are included here.
annolite(
x = unga_migrationspeeches_fulltext,
annotations = unga_migrationspeeches_anntationstable,
group = "unga"
)To limit the number of dependencies, to avoid having to download somewhat large data when preparing the vignette, and to limit computation time, the objects unga_migrationspeeches_fulltext and unga_migrationspeeches_anntationstable are included as sample data in the package. The code used is in the “data-raw” folder.
Including an annolite htmlwidget as such, or an HTML element that integrates it into a HTML presentation may be a great way to explain or document research. However, generating a an ioslides presentation from Rmarkdown (Xie, Allaire, and Grolemund 2018), which is flexible and comfortable in general, will usually entail some awkward CSS conflicts when it comes to including an annolite htmlwidget. The best practice we could identify is to wrap the HTML element that includes an annolite htmlwidget into a “widgetframe”, an approach introduced with the widgetframe package. Technically, the htmlwidget will be put into an iframe, thus containing it and avoding CSS conflicts.
A template for creating slides included in the package provides further explanations and includes the relevant code. The template is available from RStudio via the menu: File > New File > R Markdown > From Template > slides {annolite}.
The implementation of the annolite htmlwidget interprets the orignal intentions of the authors of htmlwidgets and the crosstalk functionality quite freely in two respects:
The original intention of htmlwidgets is to provide “a framework for creating R bindings to JavaScript libraries” (see https://www.htmlwidgets.org/develop_intro.html). The JavaScript code of htmlwidgets is usually a binding for the existing underlying JavaScript library. In the case of annolite, the JavaScript code has been written exclusively for the annolite package. This is not what most htmlwidgets do, but it does not involve dirty hacks and is possible in a straightforward manner.
The situation is somewhat different when it comes to the way the annolite htmlwidget is crosstalk-enabled: Crosstalk is designed to work with data frames, or data frame-like objects (see https://rstudio.github.io/crosstalk/authoring.html). The fulltextlist object that is passed to the widget is a nested list, not a tabular data format, and there is a set of implicit choices, which columns of the data connect widgets. As a consequence, a part of the crosstalk functionality (filter menus, most importantly) can not be used. We hope that a future version of annolite will approximate the (see https://github.com/PolMine/annolite/issues/8) requirements of crosstalk in a more dataframish manner.
There is a mixture of the S3 and the S4 system of object-oriented programming within the package: The S4 method fulltextlist() is the constructor for the S3 fulltextlist class. This hybrid and potentially confusing state of affairs has one great advantage: The S3 fulltextlist class can be converted to a JSON string that is passed to JavaScript directly, whereas some intermediate data transformation would be necessary if the fulltextlist was an S4 class. Hopefully it will ease the irritation of users being aware of this background.
For complex annotation tasks: INcePTION
Blei, David M. 2012. “Probabilistic Topic Models.” Commun. ACM 55 (4): 77–84. https://doi.org/10.1145/2133806.2133826.
Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–97. https://doi.org/{10.1093/pan/mps028}.
Moretti, Franco. 2013. Distant Reading. London: Verso.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.